Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction and planning. As sensors and hardware get improved, there is trending popularity to devise a system that can perform a wide diversity of tasks to fulfill higher-level intelligence. Contemporary approaches resort to either deploying standalone models for individual tasks, or designing a multi-task paradigm with separate heads. These might suffer from accumulative error or negative transfer effect. Instead, we argue that a favorable algorithm framework should be devised and optimized in pursuit of the ultimate goal, i.e. planning of the self-driving-car. Oriented at this goal, we revisit the key components within perception and prediction. We analyze each module and prioritize the tasks hierarchically, such that all these tasks contribute to planning (the goal). To this end, we introduce Unified Autonomous Driving (UniAD), the first comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query design to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven to surpass previous state-of-the-arts by a large margin in all aspects. The full suite of codebase and models would be available to facilitate future research in the community.
translated by 谷歌翻译
Pre-trained language models for programming languages have shown a powerful ability on processing many Software Engineering (SE) tasks, e.g., program synthesis, code completion, and code search. However, it remains to be seen what is behind their success. Recent studies have examined how pre-trained models can effectively learn syntax information based on Abstract Syntax Trees. In this paper, we figure out what role the self-attention mechanism plays in understanding code syntax and semantics based on AST and static analysis. We focus on a well-known representative code model, CodeBERT, and study how it can learn code syntax and semantics by the self-attention mechanism and Masked Language Modelling (MLM) at the token level. We propose a group of probing tasks to analyze CodeBERT. Based on AST and static analysis, we establish the relationships among the code tokens. First, Our results show that CodeBERT can acquire syntax and semantics knowledge through self-attention and MLM. Second, we demonstrate that the self-attention mechanism pays more attention to dependence-relationship tokens than to other tokens. Different attention heads play different roles in learning code semantics; we show that some of them are weak at encoding code semantics. Different layers have different competencies to represent different code properties. Deep CodeBERT layers can encode the semantic information that requires some complex inference in the code context. More importantly, we show that our analysis is helpful and leverage our conclusions to improve CodeBERT. We show an alternative approach for pre-training models, which makes fully use of the current pre-training strategy, i.e, MLM, to learn code syntax and semantics, instead of combining features from different code data formats, e.g., data-flow, running-time states, and program outputs.
translated by 谷歌翻译
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.
translated by 谷歌翻译
Fully convolutional detectors discard the one-to-many assignment and adopt a one-to-one assigning strategy to achieve end-to-end detection but suffer from the slow convergence issue. In this paper, we revisit these two assignment methods and find that bringing one-to-many assignment back to end-to-end fully convolutional detectors helps with model convergence. Based on this observation, we propose {\em \textbf{D}ual \textbf{A}ssignment} for end-to-end fully convolutional de\textbf{TE}ction (DATE). Our method constructs two branches with one-to-many and one-to-one assignment during training and speeds up the convergence of the one-to-one assignment branch by providing more supervision signals. DATE only uses the branch with the one-to-one matching strategy for model inference, which doesn't bring inference overhead. Experimental results show that Dual Assignment gives nontrivial improvements and speeds up model convergence upon OneNet and DeFCN. Code: https://github.com/YiqunChen1999/date.
translated by 谷歌翻译
很少有学习模型学习人类注释有限,而这种学习范式在各种任务中证明了实用性数据使该模型无法充分探索语义信息。为了解决这个问题,我们将知识蒸馏引入了几个弹出的对象检测学习范式。我们进一步进行了激励实验,该实验表明,在知识蒸馏的过程中,教师模型的经验误差将少数拍物对象检测模型的预测性能(作为学生)退化。为了了解这种现象背后的原因,我们从因果理论的角度重新审视了几个对象检测任务上知识蒸馏的学习范式,并因此发展了一个结构性因果模型。遵循理论指导,我们建议使用基于后门调整的知识蒸馏方法,用于少数拍物检测任务,即Disentangle和Remerge(D&R),以对相应的结构性因果模型进行有条件的因果干预。从理论上讲,我们为后门标准提供了扩展的定义,即一般后门路径,可以在特定情况下扩展后门标准的理论应用边界。从经验上讲,多个基准数据集上的实验表明,D&R可以在几个射击对象检测中产生显着的性能提升。
translated by 谷歌翻译
由于其在多个工业应用领域的竞争性能,深度学习在我们的日常生活中起着越来越重要的作用。作为基于DL的系统的核心,深度神经网络会自动从精心收集和有组织的培训数据中学习知识,以获得预测看不见数据的标签的能力。与需要全面测试的传统软件系统类似,还需要仔细评估DNN,以确保受过训练的模型的质量满足需求。实际上,评估行业中DNN质量的事实上的标准是检查其在收集的标记测试数据集中的性能(准确性)。但是,准备这样的标记数据通常不容易部分,部分原因是标签工作巨大,即数据标记是劳动密集型的,尤其是每天有大量新的新传入的未标记数据。最近的研究表明,DNN的测试选择是一个有希望的方向,可以通过选择最小的代表性数据来标记并使用这些数据来评估模型来解决此问题。但是,它仍然需要人类的努力,不能自动。在本文中,我们提出了一种名为Aries的新技术,可以使用原始测试数据获得的信息估算新未标记数据的DNN的性能。我们技术背后的关键见解是,该模型在与决策边界具有相似距离的数据上应具有相似的预测准确性。我们对13种数据转换方法的技术进行了大规模评估。结果表明,我们技术的有用性是,白羊座的估计准确性仅为0.03%-2.60%(平均0.61%),从真实的准确性中差。此外,在大多数(128个)情况下,白羊座还优于最先进的选择标记方法。
translated by 谷歌翻译
联合学习(FL)和分裂学习(SL)是两种新兴的协作学习方法,可能会极大地促进物联网(IoT)中无处不在的智能。联合学习使机器学习(ML)模型在本地培训的模型使用私人数据汇总为全球模型。分裂学习使ML模型的不同部分可以在学习框架中对不同工人进行协作培训。联合学习和分裂学习,每个学习都有独特的优势和各自的局限性,可能会相互补充,在物联网中无处不在的智能。因此,联合学习和分裂学习的结合最近成为一个活跃的研究领域,引起了广泛的兴趣。在本文中,我们回顾了联合学习和拆分学习方面的最新发展,并介绍了有关最先进技术的调查,该技术用于将这两种学习方法组合在基于边缘计算的物联网环境中。我们还确定了一些开放问题,并讨论了该领域未来研究的可能方向,希望进一步引起研究界对这个新兴领域的兴趣。
translated by 谷歌翻译
通道修剪被广泛用于降低深网模型的复杂性。最近的修剪方法通常通过提出通道重要性标准来识别网络的哪些部分。但是,最近的研究表明,这些标准在所有情况下都不能很好地工作。在本文中,我们提出了一种新颖的功能最小化方法(FSM)方法来压缩CNN模型,该模型通过收敛功能和过滤器的信息来评估特征转移。具体而言,我们首先使用不同层深度的一些普遍方法研究压缩效率,然后提出特征转移概念。然后,我们引入了一种近似方法来估计特征移位的幅度,因为很难直接计算它。此外,我们提出了一种分布优化算法,以补偿准确性损失并提高网络压缩效率。该方法在各种基准网络和数据集上产生最先进的性能,并通过广泛的实验验证。这些代码可以在\ url {https://github.com/lscgx/fsm}上可用。
translated by 谷歌翻译
在本文中,我们介绍了Siammask,这是一个实时使用相同简单方法实时执行视觉对象跟踪和视频对象分割的框架。我们通过通过二进制细分任务来增强其损失,从而改善了流行的全面暹罗方法的离线培训程序。离线训练完成后,SiamMask只需要一个单个边界框来初始化,并且可以同时在高框架速率下进行视觉对象跟踪和分割。此外,我们表明可以通过简单地以级联的方式重新使用多任务模型来扩展框架以处理多个对象跟踪和细分。实验结果表明,我们的方法具有较高的处理效率,每秒约55帧。它可以在视觉对象跟踪基准测试中产生实时最新结果,同时以高速进行视频对象分割基准测试以高速显示竞争性能。
translated by 谷歌翻译
在本报告中,我们在CVPR 2022的Waymo Open数据集挑战中介绍了解决方案和流程预测挑战,该挑战在排行榜上排名第一。我们已经开发了一个新型的层次空间时间网络,该网络具有时空编码器,一个富含潜在变量的多尺度聚合器以及一个递归层次结构3D解码器。我们使用多种损失,包括局灶性损失和修改的流量损失来有效指导训练过程。我们的方法达到了一个占地0.8389的流动占用AUC,并且优于排行榜上所有其他团队。
translated by 谷歌翻译